Scientific Python antipatterns advent calendar day six

For today, a slightly more complicated example that looks at program design. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.

If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:

Sign up for the mailing list

and I’ll send a single email at the end with links to them all.

Classes without methods

Python is sometimes called a multiparadigm programming language, which means that we can write programs in different styles, including an object-oriented style. Objects are a convenient way to package up data and behaviour for larger programs, but are often misued as pure data structures.

Imagine we want to store some data from my favourite example dataset, the Palmer penguins. For each penguin we need to store a species name, a flipper length, and a body mass. This feels like an object problem, so we might be tempted to write a class definition:

class Penguin:

    # flipper length in mm, body mass in g
    def __init__(self, species, flipper_length, body_mass):
        self.species = species
        self.flipper_length = flipper_length
        self.body_mass = body_mass

and then create some objects:

p1 = Penguin('Adelie', 180, 3750)
p2 = Penguin('Adelie', 160, 4000)
p3 = Penguin('Chinstrap', 198, 3200)

and then do some data processing:

# select all penguins heavier than 3.5 kg
heavy_penguins_flipper_lengths = []
for penguin in [p1,p2,p3]:
    if penguin.body_mass > 3500:
        heavy_penguins_flipper_lengths.append(penguin.flipper_length)

heavy_penguins_flipper_lengths
[180, 160]

This design works, but doesn’t really take advantage of the power of classes. In our class definition, we have data, but no behaviour, so we are incurring the extra computational overhead and code complexity of a class definition for no real benefit.

There are several better options. One is to simply use a list of tuples to store the data:

penguins = [
    ('Adelie', 180, 3750),
    ('Adelie', 160, 4000),
    ('Chinstrap', 198, 3200)
]

and use tuple unpacking when processing them:

heavy_penguins_flipper_lengths = []

for penguin in penguins:
    species, flipper_length, body_mass = penguin
    if body_mass > 3500:
        heavy_penguins_flipper_lengths.append(flipper_length)
        
heavy_penguins_flipper_lengths
[180, 160]

Having to explicitly unpack the tuple into individual variables like this:

species, flipper_length, body_mass = penguin

every time we want to use them is annoying, so an even better option would be to make a named tuple:

from collections import namedtuple

# flipper length in mm, body mass in g
Penguin = namedtuple("Penguin", ["species", "flipper_length", "body_mass"])

This allows us to construct our penguin objects just like before:

penguins = [
    Penguin('Adelie', 180, 3750),
    Penguin('Adelie', 160, 4000),
    Penguin('Chinstrap', 198, 3200)
]

and use the attribute names without unpacking:

heavy_penguins_flipper_lengths = []

for penguin in penguins:
    if penguin.body_mass > 3500:
        heavy_penguins_flipper_lengths.append(penguin.flipper_length)

heavy_penguins_flipper_lengths
[180, 160]

but without having to write a full class definition, or incur the computational overhead of custom classes - the data are still stored internally very efficiently as a tuple.

One downside of the named tuple approach is that if we later decide that we want to add some methods, we have to find the named tuple defintion and replace it with a class definition. An alternative to the named tuple would be a dataclass:

from dataclasses import dataclass

@dataclass
class Penguin:
    species: str
    flipper_length: int  # in mm
    body_mass: int       # in g

This gives us an alternative way of defining classes that will be used mostly for storing attributes, and we can still use our nice syntax as before:

penguins = [
    Penguin('Adelie', 180, 3750),
    Penguin('Adelie', 160, 4000),
    Penguin('Chinstrap', 198, 3200)
]

heavy_penguins_flipper_lengths = []

for penguin in penguins:
    if penguin.body_mass > 3500:
        heavy_penguins_flipper_lengths.append(penguin.flipper_length)

heavy_penguins_flipper_lengths
[180, 160]

but keeps the option of adding methods to the class later on if we want to:

@dataclass
class Penguin:
    species: str
    flipper_length: int  # in mm
    body_mass: int       # in g

    def body_mass_kg(self):
        return self.body_mass / 1000.0



p = Penguin(species="Adelie", flipper_length=181, body_mass=3750)

p.body_mass_kg()
3.75

One final option: if we are working in an environment where we can install packages, we could use an existing library to handle the data storage. If we represented our penguins as rows in a pandas dataframe:

import pandas as pd

penguins = pd.DataFrame(
    [
        ("Adelie", 180, 3750),
        ("Adelie", 160, 4000),
        ("Chinstrap", 198, 3200),
    ],
    columns=["species", "flipper_length", "body_mass"],
)

penguins
species flipper_length body_mass
0 Adelie 180 3750
1 Adelie 160 4000
2 Chinstrap 198 3200

the we would have access to all the usual pandas tools for filtering, etc.:

penguins[penguins['body_mass'] > 3500]['flipper_length']
0    180
1    160
Name: flipper_length, dtype: int64

One more time; if you want to see the rest of these little write-ups, sign up for the mailing list:

Sign up for the mailing list